Goto

Collaborating Authors

 bar exam


GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

arXiv.org Artificial Intelligence

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.


After exam fiasco, California State Bar faces deeper financial crisis

Los Angeles Times

The California State Bar's botched roll out of a new exam -- a move that the cash-strapped agency made in the hopes of saving money -- could ultimately end up costing it an additional 5.6 million. Leah T. Wilson, executive director of the State Bar, told state lawmakers at a Senate Judiciary hearing Tuesday that the agency expects to pay around 3 million to offer free exams to test takers, an additional 2 million to book in-person testing sites in July, and 620,000 to return the test to its traditional system of multiple-choice questions in July. Wilson, who announced last week she will step down when her term ends this summer, revealed the costs during a 90-minute hearing called by Sen. Thomas J. Umberg (D-Orange), chair of the Senate Judiciary Committee, to find out what went so "spectacularly wrong." Chaos ensued in February when thousands of test takers seeking to practice law in California sat for the new exam. Some reported they couldn't log into the exam because online testing platforms repeatedly crashed.


Head of State Bar of California to step down after exam fiasco

Los Angeles Times

The State Bar of California announced Friday that its embattled leader, who has faced growing pressure to resign over the botched February roll out of a new bar exam, will step down in July. Leah T. Wilson, the agency's executive director, informed the Board of Trustees she will not seek another term in the position she has held on and off since 2017. She also apologized for her role in the February bar exam chaos. "Accountability is a bedrock principle for any leader," Wilson said in a statement. "At the end of the day, I am responsible for everything that occurs within the organization. Despite our best intentions, the experiences of applicants for the February Bar Exam simply were unacceptable, and I fully recognize the frustration and stress this experience caused. While there are no words to assuage those emotions, I do sincerely apologize."


Pressure grows on State Bar of California to revert to national exam format in July after botched exam

Los Angeles Times

An influential California legislator is pressuring the State Bar of California to ditch its new multiple-choice questions after a February bar exam debacle and revert to the traditional test format in July. "Given the catastrophe of the February bar, I think that going back to the methods that have been used for the last 50 years -- until we can adequately test what new methods may be employed -- is the appropriate way to go," Sen. Tom Umberg (D-Orange), chair of the state Senate Judiciary Committee, told The Times. Thousands of test takers seeking to practice law in California typically take the two-day bar exam in July. Reverting to the national system by the National Conference of Bar Examiners, which California has used since 1972, would be a major retreat for the embattled State Bar. Its new exam was rolled out this year as a cost-cutting measure and "historic agreement" that would offer test takers the choice of remote testing.


Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners' frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.


Beyond Turing: Testing LLMs for Intelligence

Communications of the ACM

In the nearly two years since its release, ChatGPT has shown some remarkably human-like behavior, from trying to seduce a journalist to acing the bar exam. That has left some people wondering whether computers are approaching human levels of intelligence. Most computer scientists do not think machines are the intellectual equals of people yet, but they have not developed a consensus on how to measure intelligence, or what exactly to measure. The canonical experiment to check for machine intelligence is the Turing test, proposed by Alan Turing in his 1950 paper "Computing Machinery and Intelligence." Turing argues that if a computer could convince a person having a typed conversation with it that it was human, that might be a sign of intelligence.


Don't Let Mistrust of Tech Companies Blind You to the Power of AI

WIRED

It seems evident to me that almost 70 years after the first conference on artificial intelligence--where the nascent field's leaders suggested the task would be completed within a decade--the field is now poised to make a transformational impact on our lives. We don't need to reach artificial general intelligence, or AGI, whatever that means, for this to happen. I wrote as much in this column three weeks ago, citing evidence that after the astonishing leap of large language models that gave us ChatGPT, the advancements had not "plateaued" as some critics were charging. I also disagreed with the wave of skeptics claiming that what looked amazing in OpenAI's GPT-4, Anthropic's Claude 3, Meta's Llama 3, and an armada of Microsoft Copilots was merely a linguistic variation of a card trick. The hype, I insisted, is justified.


OpenAI's GPT-4 exhibits "human-level performance" on professional benchmarks

#artificialintelligence

On Tuesday, OpenAI announced GPT-4, a large multimodal model that can accept text and image inputs while returning text output that "exhibits human-level performance on various professional and academic benchmarks," according to OpenAI. Also on Tuesday, Microsoft announced that Bing Chat has been running on GPT-4 all along. If it performs as claimed, GPT-4 potentially represents the opening of a new era in artificial intelligence. "It passes a simulated bar exam with a score around the top 10% of test takers," writes OpenAI in its announcement. OpenAI plans to release GPT-4's text capability through ChatGPT and its commercial API, but with a waitlist at first.


ChatGPT Can Pass the Bar Exam Now. So What? - CNET

#artificialintelligence

When I was studying journalism at university, we had an assignment called News Day, designed to replicate a day in the life of a reporter. You arrived at school in the morning and were assigned a story to be filed by the end of the day. I've forgotten what my specific story assignment was -- it was 12 years ago -- only that it had something to do with climate change. What I do remember, with painful lucidity, is an interview with an academic who'd agreed to help me. After about 10 minutes, he correctly intuited from my questions that I didn't understand the issue, whatever it was.


What is GPT-4 and how does it differ from ChatGPT?

#artificialintelligence

OpenAI's latest release, GPT-4, is the most powerful and impressive AI model yet from the company behind ChatGPT and the Dall-E AI artist. The system can pass the bar exam, solve logic puzzles, and even give you a recipe to use up leftovers based on a photo of your fridge – but its creators warn it can also spread fake facts, embed dangerous ideologies, and even trick people into doing tasks on its behalf. Here's what you need to know about our latest AI overlord. GPT-4 is, at heart, a machine for creating text. But it is a very good one, and to be very good at creating text turns out to be practically similar to being very good at understanding and reasoning about the world. And so if you give GPT-4 a question from a US bar exam, it will write an essay that demonstrates legal knowledge; if you give it a medicinal molecule and ask for variations, it will seem to apply biochemical expertise; and if you ask it to tell you a gag about a fish, it will seem to have a sense of humour – or at least a good memory for bad cracker jokes ("what do you get when you cross a fish and an elephant?